feat: decoupled replay -- flow recording and independent NS3 replay by yanzhenghao · Pull Request #291 · aliyun/SimAI

yanzhenghao · 2026-06-18T05:45:36Z

Summary

Implements decoupled replay: SimAI captures flow metadata and timing during coupled simulation, then an independent NS3 binary replays from the flow file without linking any SimAI code.

SimAI side

MockNcclGroup: flow buffer accumulation, send/completion-time recording, completion-based relative_delay_ns. Per-rank sorted map resolves prev[] rank IDs to predecessor flow IDs via lower_bound.
AstraSimNetwork.cc: recordFlowSendTime() in sim_send(), explicit finalize between Run and Destroy
Sys.cc: finalizeFlowFile() in destructor (analytical safety net)
entry.h: recordFlowCompletionTime() in qp_finish()
common.h (mirror 3 copies synced): uint64_t relative_delay_ns field
check-common-h-consistency.sh: CI diff-check script

Independent binary

8 files under ns-3-alibabacloud/simulation/scratch/decoupled_replay/. SetConfig() is called explicitly after ReadConf() because the independent binary has no SimAI framework to apply NS3 defaults (QCN, PFC thresholds, CC mode).

Scheduling: layer constraint (hard gate) + relative_delay_ns (soft gate). No flow-level dependency graph. Causality fully encoded in completion-based timing from Phase 1.

Co-Authored-By: Claude noreply@anthropic.com

…ication

- Fix curl global init thread safety: use singleton CurlGlobalManager - Fix cross-rack detection: use global_rank_rack_map_ instead of gpus_per_server_ - Initialize WorkloadConfig members with default values - Optimize dependency tracking from O(n²) to O(n) using map lookup - Add error return values to OxcFlowOutput functions - Rename static debug counters for clarity - Add DP workload test file - Update design document with Mermaid diagrams Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove git submodules (SimCCL/aicb/ns-3-alibabacloud) — all code in one repo - Add ranks field to LogItem CSV output with participating GPU rank IDs - Add sidecar _rank_mapping.csv with full rank group decomposition - Add rank_mapper.py: CommGroup-to-RankGenerator token bridge (7 group types) - Add _fill_ranks() in WorkloadGenerator for automatic rank population - Add Domain Flow Graph + Domain MsgSize Bar visualization charts - Add per-rank CSV generation script (generate_per_rank_csv.py) - Add 15 unit tests (rank_mapper + LogItem serialization) - Fix LRA gate to support multiple concurrent in_progress features - Deep Interview spec + Ralplan consensus plan artifacts Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

v3 field mapping: node_ip→node_id, a_node_ip→a_node_id, port_infos→port_id_list, port_name→port_id, server_type→chassis_topo merger.py: N:N topology (OXC→spine→leaf fan-out, server→N leaves) ns3_emitter.py: bandwidth from port_id (800GE→800Gbps), NPU from chassis_topo edg_client.py: spine-aware mock crosses and smart adjustment HomePage.tsx: frontend v3 auto-detect (server IP, bandwidth, NPU type) lld_to_topology.py: v3 visualization with IP-based cell IDs SimAI.conf: /etc→/tmp paths, +800Gbps rate map SimAI_training_workload_generator.py: fix get_model_details() model→self.model Tests: 99/99 pass, NS3 verified 8/16/32 GPU ALLREDUCE with AIOB workload Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace spine_to_leaves[spine_ip] with spine_port_to_leaf[(spine_ip, spine_port)] for exact per-port edge resolution from LLDP data. No hardcoded formulas. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…lient Apply re.sub(r"$\d+$$", "", node_id) in _build_edge_maps, resolve_paths, _mock_baseline_crosses, and _smart_adjustment so OXC node_id "IP(0)" matches edge a_node_id "IP(0)" regardless of format. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Migrate lld.json/init_crosses.json from per-session workspace to global EDG_DATA_ROOT/{topology_dir}/ storage. wizard-store add zustand/persist for EDG graph data survival across page refreshes. - server/config.py: add EDG_DATA_ROOT config - server/edg/routes.py: _edg_global_dir() + _edg_load() with global-first, workspace-fallback strategy. init writes to both stores. - edg-api.ts + EdgPage.tsx: pass topologyDir to baseline-graph/register-task - wizard-store.ts: persist EDG graph data to localStorage - feature_list.json: F091 added, F090 marked done Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

v3 lld server node_id is a name (superpod#0_server#0), not an IP. Rename across 10 files: frontend types/stores/api/pages + backend routes/merger/tests. npu_match server_ip field preserved (external EDG protocol contract). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- network-store setActiveNetwork now resets wizard store and clears ocs-sim-wizard localStorage to prevent stale graph data leakage - lld_to_topology.py now detects group_id from lld.json and generates per-group pod XMLs instead of 1 pod per input file (8 pods for 8 groups) - Fix pre-existing ntype→node_type variable reference bug in generate_pod_xml / generate_pod_xml_with_crosses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Major changes: - lld.json: rewrite to 8NPU×8port per server topology (64 leaf, 512 edges) - ns3_emitter.py: each NPU→all leaves (not round-robin 1:1) - merger.py: fix IP string sort→numeric sort for leaf ordering - F091: EDG init global persistence (EDG_DATA_ROOT) - F092: serverIps→serverIds rename (v3 lld uses names not IPs) - network-store.ts: localStorage migration + network switch reset - routes.py: global EDG store + server_ids params - Various OXC/NS3 C++ fixes from previous sessions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- ns3_emitter.py: add unfolded mode (explicit spine/OXC switch nodes) via optional lld param. Leaf→Spine + Spine→OXC replaces Leaf↔Leaf. Backward compatible: no lld = folded mode. - merger.py: fix IP string sort→numeric sort for leaf ordering - lld.json: rewrite to 8NPU×8port×8leaf per server topology - F091: EDG init global persistence (EDG_DATA_ROOT) - F092: serverIps→serverIds rename + localStorage migration Unfolded topology: 35 nodes (16NPU+16Leaf+2Spine+1OXC), 202 links. Cross-server path: NPU→Leaf→Spine→OXC→Spine→Leaf→NPU (7 hops). Single OXC avoids NS3 multi-path routing loops. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ment + WorkloadPage stale EDG cleanup - ProcessList: add View cmd / View error expandable buttons with return_code + error_message display, color-coded status badges - ns3_emitter: raise RuntimeError when lld has spine/OXC but unfolding produces zero links — no more silent fallback to folded mode - WorkloadPage: clear edgTopologyPath/BaselineGraph/AdjustedGraph/Diff on new workload generation to prevent stale EDG topology leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… inference, CP, Chakra, tensor graph Complete AICB workload generator extensibility implementation (F093-F099): F093: Model registry -- MODEL_REGISTRY dict + _bootstrap.py registration replaces hardcoded if/elif dispatch in generate_megatron_workload.py F094: LLaMA MockedModel -- MockedLlama.py (539 lines): GQA + SwiGLU + RMSNorm pre-norm, reuses MegatronColumn/RowLinear for TP. Supports LLaMA 2/3/4 configs (7B through 70B, dense and MoE). F095: Parameterized MoE routing -- --n_shared_expert moved from DeepSeek-only to get_moe_params (all MoE models). MOEMLP shared expert computation. F096: Qwen3 inference -- MockedQwen3Moe.py (344 lines, 8 classes) + MockedQwen3Next.py (287 lines, 4 classes + GatedDeltaNet). F097: Context Parallelism -- CommGroup.cp_group + ContextParallelRing (110 lines) for ring-attention isend/irecv between CP neighbors. F098: Chakra output format -- ChakraWriter (178 lines) converts AICB LogItem to MLCommons Chakra JSON schema (COMP_ONLY/COMM_COLL/COMM_SEND/COMM_RECV). F099: Declarative tensor graph -- tensor_graph package (345 lines): TensorGraph CSV load/dump, ReplicateGraph layer stacking, ConnectGraph port wiring. SwiGLU FFN 8-line CSV template as proof-of-concept. Also: registry.py (119 lines) + _bootstrap.py (103 lines) infrastructure, --num_kv_heads CLI arg for GQA architectures, test_registry.py and test_mocked_llama.py test files. Research: research_aicb_extensibility.md (491 lines) -- STAGE paper analysis, PARAM/Chakra comparison, 2025-2026 model parallel strategy survey. 21 files, +3662/-130 Co-Authored-By: Claude <noreply@anthropic.com>

…g, Qwen3 inference, CP, Chakra, tensor graph" This reverts commit 8604a3c.

- Fetch simulation progress via fetchProgress API for running processes - Show progress bar with percentage, layer count, and ETA - Extract and display workload filename from command line (-w argument) - Restructure layout into two-row format (status+PID+buttons / progress bar) - Add formatETA and extractWorkloadName helper functions Co-Authored-By: Claude <noreply@anthropic.com>

All three use LLaMA-compatible architecture (RMSNorm + SwiGLU + GQA + RoPE), reuse existing MegatronModel workload generator. Verified parameters from deep-research HF config.json analysis. Co-Authored-By: Claude <noreply@anthropic.com>

…letion-based timing Adds flow recording instrumentation for NS3 decoupled replay: - MockNcclGroup: flow buffer accumulation, send-time & completion-time recording, deferred finalizeFlowFile() with completion-based relative_delay_ns - AstraSimNetwork.cc: recordFlowSendTime in sim_send, explicit finalize in main - Sys.cc: finalizeFlowFile call in destructor (analytical mode safety net) - common.h: relative_delay_ns field in FlowRecord - scripts/check-common-h-consistency.sh: CI diff-check for 3 common.h copies relative_delay_ns = send_time - max(prev completion times), clamped to 0. No flow-level dependency graph -- causality fully encoded in timing. Co-Authored-By: Claude <noreply@anthropic.com>

Independent binary (scratch/decoupled_replay/): 8 files, 2,596 lines - Zero SimAI linkage (nm check target) - DepScheduler with prev[] dependency graph + layer_num constraint - flow_reader.h parses complete 21-field format - Whitelisted in scratch/.gitignore CI: scripts/check-common-h-consistency.sh Co-Authored-By: Claude <noreply@anthropic.com>

Plan and all documentation reference scratch_decoupled_replay. CMakeLists.txt had scratch_decoupled_replay_main which would break nm verification commands. Co-Authored-By: Claude <noreply@anthropic.com>

prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed by flow ID (10423,10424...). The naive _flow_completion_times.count(pid) always failed for ring allreduce flows, causing every flow to fall back to relative_delay_ns = absolute send_time. Build a per-rank sorted map of (flow_id, completion_time) pairs, then resolve each prev rank ID to the most recent predecessor flow from that rank via lower_bound. This correctly computes relative_delay_ns as send_time - max(predecessor completion times).

Co-Authored-By: Claude <noreply@anthropic.com>

Cherry-picked reverted commits from submodule reflog: - feat: decoupled replay Phase 2 (1598bbc) - fix: GPUType enum, fct_writer format string (f0e19bb) - refactor: inline SendFlow, remove _QPS_PER_CONNECTION_ (6843092) - fix: sequential step numbering in main.cc (64a3613)

prev[] contains rank IDs (0,1,2...) but _flow_completion_times is keyed by flow ID (10423,10424...). The naive _flow_completion_times.count(pid) always failed for ring allreduce flows, causing every flow to fall back to relative_delay_ns = absolute send_time. Build a per-rank sorted map of (flow_id, completion_time) pairs, then resolve each prev rank ID to the most recent predecessor flow from that rank via lower_bound. This correctly computes relative_delay_ns as send_time - max(predecessor completion times).

Co-Authored-By: Claude <noreply@anthropic.com>

CLAassistant · 2026-06-18T05:45:47Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Anthony
❌ yanzhenghao

Anthony seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Co-Authored-By: Claude <noreply@anthropic.com>

Anthony and others added 30 commits April 17, 2026 17:54

Add SimAI-OXC integration for optical cross-connect collective commun…

0d602ce

…ication

Add OXC build directory to gitignore

1c89435

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove OXC build artifacts from repository

be96a65

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add server and dashboard directories

51bf547

fix: port-level spine-to-leaf mapping, remove device-level aggregation

6956715

Replace spine_to_leaves[spine_ip] with spine_port_to_leaf[(spine_ip, spine_port)] for exact per-port edge resolution from LLDP data. No hardcoded formulas. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix: strip edge node_ids in resolve_paths srv_leaf_count loop

965cb70

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Revert "feat: AICB extensibility -- model registry, LLaMA, MoE routin…

2ca91ad

…g, Qwen3 inference, CP, Chakra, tensor graph" This reverts commit 8604a3c.

chore: update ns-3-alibabacloud submodule pointer for decoupled replay

9ab14e1

fix: rename binary target to scratch_decoupled_replay for consistency

4b62f6f

Plan and all documentation reference scratch_decoupled_replay. CMakeLists.txt had scratch_decoupled_replay_main which would break nm verification commands. Co-Authored-By: Claude <noreply@anthropic.com>

chore: update ns-3-alibabacloud submodule for binary rename

49b2a58

Co-Authored-By: Claude <noreply@anthropic.com>

Merge branch 'feat/decoupled-replay-phase1'

7898c7f

chore: sync state files and dep_scheduler refinements

32af22f

Co-Authored-By: Claude <noreply@anthropic.com>

yanzhenghao and others added 2 commits June 18, 2026 13:46

chore: sync OMC state files

f1082ce

Co-Authored-By: Claude <noreply@anthropic.com>

chore: sync state

55d583c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: decoupled replay -- flow recording and independent NS3 replay#291

feat: decoupled replay -- flow recording and independent NS3 replay#291
yanzhenghao wants to merge 32 commits into
aliyun:masterfrom
yanzhenghao:feat/decoupled-replay-phase1

yanzhenghao commented Jun 18, 2026

Uh oh!

CLAassistant commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yanzhenghao commented Jun 18, 2026

Summary

SimAI side

Independent binary

Uh oh!

CLAassistant commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants